@dataknut)Please note that authorship is alphabetical. Contributions are listed below - see github for details and who to blame for what :-).
@dataknut)Free and easy to do. Raise an issue at:
Exploring covid 19 and related data.
This tweet:
Suggested something strange was happening with overall USA deaths in early 2020.:
USA deaths plot
This very brief analysis attempts to re-create some of their results from the same data.
Sourced from the downloads button of https://gis.cdc.gov/grasp/fluview/mortality.html
“The National Center for Health Statistics (NCHS) collects and disseminates the Nation’s official vital statistics. NCHS collects death certificate data from state vital statistics offices for virtually all deaths occurring in the United States. Pneumonia and influenza (P&I) deaths are identified based on ICD-10 multiple cause of death codes.”
dt <- data.table::fread(paste0(myParams$dPath, "/USA-all-data.csv"))
t <- table(dt$SEASON)
kableExtra::kable(t, caption = "Seasons used") %>% kable_styling()
| Var1 | Freq |
|---|---|
| 2015-16 | 52 |
| 2016-17 | 52 |
| 2017-18 | 52 |
| 2018-19 | 52 |
| 2019-20 | 24 |
Note that the 2020 data is incomplete.
Before we go further we’ll convert week numbers (in the data) to standardised dates. This will mean some of the dates for the weeks are not exactly right but it might make for easier plotting.
# why can't they just be published with dates??
dt[, `:=`(date, ifelse(SEASON == "2015-16" & WEEK > 39, as.Date("2015-01-01") + lubridate::weeks(WEEK),
NA))]
dt[, `:=`(date, ifelse(SEASON == "2015-16" & WEEK < 40, as.Date("2016-01-01") + lubridate::weeks(WEEK),
date))]
dt[, `:=`(date, ifelse(SEASON == "2016-17" & WEEK > 39, as.Date("2016-01-01") + lubridate::weeks(WEEK),
date))]
dt[, `:=`(date, ifelse(SEASON == "2016-17" & WEEK < 40, as.Date("2017-01-01") + lubridate::weeks(WEEK),
date))]
dt[, `:=`(date, ifelse(SEASON == "2017-18" & WEEK > 39, as.Date("2017-01-01") + lubridate::weeks(WEEK),
date))]
dt[, `:=`(date, ifelse(SEASON == "2017-18" & WEEK < 40, as.Date("2018-01-01") + lubridate::weeks(WEEK),
date))]
dt[, `:=`(date, ifelse(SEASON == "2018-19" & WEEK > 39, as.Date("2018-01-01") + lubridate::weeks(WEEK),
date))]
dt[, `:=`(date, ifelse(SEASON == "2018-19" & WEEK < 40, as.Date("2019-01-01") + lubridate::weeks(WEEK),
date))]
dt[, `:=`(date, ifelse(SEASON == "2019-20" & WEEK > 39, as.Date("2019-01-01") + lubridate::weeks(WEEK),
date))]
dt[, `:=`(date, ifelse(SEASON == "2019-20" & WEEK < 40, as.Date("2020-01-01") + lubridate::weeks(WEEK),
date))]
dt[, `:=`(date, lubridate::as_date(date))]
# check coding head(dt[WEEK == 1])
t <- head(dt[WEEK == 1 | WEEK == 52, .(SEASON, WEEK, date)])
kableExtra::kable(t, captopn = "Check we coded the weeks correctly") %>% kable_styling()
| SEASON | WEEK | date |
|---|---|---|
| 2019-20 | 52 | 2019-12-31 |
| 2019-20 | 1 | 2020-01-08 |
| 2018-19 | 52 | 2018-12-31 |
| 2018-19 | 1 | 2019-01-08 |
| 2017-18 | 52 | 2017-12-31 |
| 2017-18 | 1 | 2018-01-08 |
# also clean up data where needed
dt[, `:=`(totDeaths, as.numeric(gsub(",", "", `TOTAL DEATHS`)))]
dt[, `:=`(pneuDeaths, as.numeric(gsub(",", "", `NUM PNEUMONIA DEATHS`)))]
dt[, `:=`(fluDeaths, as.numeric(gsub(",", "", `NUM INFLUENZA DEATHS`)))]
dt$V13 <- NULL
dt$V14 <- NULL
dt$V15 <- NULL
dt$V16 <- NULL
dt$V17 <- NULL
dt$V18 <- NULL
skimr::skim(dt)
| Name | dt |
| Number of rows | 232 |
| Number of columns | 16 |
| _______________________ | |
| Column type frequency: | |
| character | 7 |
| Date | 1 |
| logical | 1 |
| numeric | 7 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| AREA | 0 | 1 | 8 | 8 | 0 | 1 | 0 |
| AGE GROUP | 0 | 1 | 3 | 3 | 0 | 1 | 0 |
| SEASON | 0 | 1 | 7 | 7 | 0 | 5 | 0 |
| NUM INFLUENZA DEATHS | 0 | 1 | 1 | 5 | 0 | 135 | 0 |
| NUM PNEUMONIA DEATHS | 0 | 1 | 5 | 5 | 0 | 220 | 0 |
| TOTAL DEATHS | 0 | 1 | 6 | 6 | 0 | 228 | 0 |
| PERCENT COMPLETE | 0 | 1 | 6 | 6 | 0 | 2 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date | 0 | 1 | 2015-10-08 | 2020-03-18 | 2017-12-27 | 232 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| SUB AREA | 232 | 0 | NaN | : |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| WEEK | 0 | 1 | 26.62 | 15.67 | 1.0 | 12.00 | 27.0 | 41.00 | 52.0 | ▇▆▆▆▇ |
| THRESHOLD | 0 | 1 | 6.78 | 0.78 | 5.4 | 6.07 | 6.8 | 7.40 | 8.2 | ▆▆▅▇▅ |
| BASELINE | 0 | 1 | 6.41 | 0.77 | 5.1 | 5.70 | 6.4 | 7.00 | 7.8 | ▇▆▇▇▆ |
| PERCENT P&I | 0 | 1 | 6.56 | 1.12 | 4.9 | 5.70 | 6.2 | 7.32 | 10.9 | ▇▅▃▁▁ |
| totDeaths | 0 | 1 | 53688.44 | 3467.98 | 37641.0 | 51073.50 | 53119.5 | 55809.75 | 67495.0 | ▁▁▇▃▁ |
| pneuDeaths | 0 | 1 | 3380.11 | 591.94 | 2431.0 | 2924.50 | 3214.5 | 3782.50 | 5583.0 | ▇▇▅▁▁ |
| fluDeaths | 0 | 1 | 167.95 | 275.90 | 2.0 | 13.75 | 31.5 | 257.50 | 1626.0 | ▇▂▁▁▁ |
tdt <- dt[`PERCENT COMPLETE` != "> 100%"]
kableExtra::kable(table(tdt$date, tdt$`PERCENT COMPLETE`), caption = "Incomplete data...") %>%
kable_styling()
| 79.10% | |
|---|---|
| 2020-03-18 | 1 |
dt <- dt[`PERCENT COMPLETE` == "> 100%"]
We remove the incomplete data from further analysis.
Check on total number of deaths per season.
t <- dt[, .(Total = dkUtils::tidyNum(sum(totDeaths)), Pneumonia = dkUtils::tidyNum(sum(pneuDeaths)),
Influenza = dkUtils::tidyNum(sum(fluDeaths))), keyby = .(SEASON)]
kableExtra::kable(t, caption = "Sum of deaths by season - do these look sensible?",
) %>% kable_styling()
| SEASON | Total | Pneumonia | Influenza |
|---|---|---|---|
| 2015-16 | 2,697,072 | 178,002 | 3,448 |
| 2016-17 | 2,790,317 | 179,621 | 6,954 |
| 2017-18 | 2,835,739 | 180,132 | 15,620 |
| 2018-19 | 2,829,403 | 168,422 | 7,171 |
| 2019-20 | 1,265,545 | 75,578 | 5,406 |
Figure 4.1 is the one causing the interest. Note that we have set the y axis to start at zero to avoid over-emphasising the slope. Week 10 is roughly:
dt[WEEK == 10, .(SEASON, WEEK, date)]
## SEASON WEEK date
## 1: 2019-20 10 2020-03-11
## 2: 2018-19 10 2019-03-12
## 3: 2017-18 10 2018-03-12
## 4: 2016-17 10 2017-03-12
## 5: 2015-16 10 2016-03-11
week10Text <- paste0("Week 10: ", dt[WEEK == 10 & SEASON == "2019-20", (date)])
addWeek10Label <- function(p) {
p <- p + annotate("text", x = 10.5, y = yMax * 0.4, angle = 10, size = 3, label = week10Text,
hjust = 0)
p <- p + geom_vline(xintercept = 10, colour = "#CC79A7", alpha = myParams$vLineAlpha)
return(p)
}
p <- ggplot2::ggplot(dt, aes(y = totDeaths, x = WEEK, colour = SEASON, group = SEASON)) +
geom_point() + ylim(0, NA) + labs(x = "Week", y = "Total deaths", caption = "Plot by @dataknut using data from https://gis.cdc.gov/grasp/fluview/mortality.html\nIncomplete data excluded")
yMax <- max(dt$totDeaths)
p <- addWeek10Label(p)
p
Figure 4.1: USA recorded deaths by flu season (all deaths)
Figure 4.2: interactive version (hover the mouse over a dot).
plotly::ggplotly(p)
Figure 4.2: USA recorded deaths by flu season (all deaths)
Figure 5.1 is pneumonia deaths. They are trending downwards too.
p <- ggplot2::ggplot(dt, aes(y = pneuDeaths, x = WEEK, colour = SEASON, group = SEASON)) +
geom_point() + ylim(0, NA) + labs(x = "Week", y = "Pneumonia deaths", caption = "Plot by @dataknut using data from https://gis.cdc.gov/grasp/fluview/mortality.html")
yMax <- max(dt$pneuDeaths)
p <- addWeek10Label(p)
p
Figure 5.1: USA recorded Pneumonia deaths by flu season
The ‘missing’ deaths here don’t seem to account for the overall total deaths reduction. It seems that pneumonia deaths are not being re-classed as covid since USA covid deaths are not recorded as starting until after March 10th. Although it is possible that earlier covid deaths were not recorded as covid or pneuomina (or influenza).
Figure 6.1 is influenza deaths.
p <- ggplot2::ggplot(dt, aes(y = fluDeaths, x = WEEK, colour = SEASON, group = SEASON)) +
geom_point() + ylim(0, NA) + labs(x = "Week", y = "Influenza deaths", caption = "Plot by @dataknut using data from https://gis.cdc.gov/grasp/fluview/mortality.html")
yMax <- max(dt$fluDeaths)
p <- addWeek10Label(p)
p
Figure 6.1: USA recorded Influenza deaths by flu season
Yep, 2017-2018 was a bad flu year in the USA…
Report generated using knitr in RStudio with R version 3.6.3 (2020-02-29) running on x86_64-apple-darwin15.6.0 (Darwin Kernel Version 19.4.0: Wed Mar 4 22:28:40 PST 2020; root:xnu-6153.101.6~15/RELEASE_X86_64).
t <- proc.time() - myParams$startTime
elapsed <- t[[3]]
Analysis completed in 3.99 seconds ( 0.07 minutes).
R packages used:
Arino de la Rubia, Eduardo, Hao Zhu, Shannon Ellis, Elin Waring, and Michael Quinn. 2017. Skimr: Skimr. https://github.com/ropenscilabs/skimr.
Dowle, M, A Srinivasan, T Short, S Lianoglou with contributions from R Saporta, and E Antonyan. 2015. Data.table: Extension of Data.frame. https://CRAN.R-project.org/package=data.table.
Grolemund, Garrett, and Hadley Wickham. 2011. “Dates and Times Made Easy with lubridate.” Journal of Statistical Software 40 (3): 1–25. http://www.jstatsoft.org/v40/i03/.
Müller, Kirill. 2017. Here: A Simpler Way to Find Your Files. https://CRAN.R-project.org/package=here.
Sievert, Carson, Chris Parmer, Toby Hocking, Scott Chamberlain, Karthik Ram, Marianne Corvellec, and Pedro Despouy. 2016. Plotly: Create Interactive Web Graphics via ’Plotly.js’. https://CRAN.R-project.org/package=plotly.
Wickham, Hadley. 2009. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. http://ggplot2.org.
Zhu, Hao. 2018. KableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.